# OPENMP\* ANALYSIS IN INTEL® VTUNE™ AMPLIFIER XE TALKING TO A USER ABOUT OPENMP\* PERFORMANCE IN THE LANGUAGE THE PROGRAM WAS WRITTEN IN, WITH A LITTLE WANDER INTO VECTORIZATION TOO Material from Dmitry Prohorov, VTune HPC lead Zakhar Matveev (Intel® Advisor Architect) and Alex Shinsel Presented by Jim Cownie #### Agenda VTune Amplifier XE OpenMP\* Analysis: answering customers' questions about performance in the same language their program was written in - Concepts, metrics and technology inside - VTune Amplifier XE OpenMP Analysis Workflow A short introduction to Intel® Advisor's Roofline Analysis Summary ## Typical customer questions on parallelization efficiency of OpenMP\* applications "I put pragmas but why is my speed up so poor?" Parallelization inefficiency "I ran my app on a system with more cores but it doesn't run as efficiently as on a smaller one" Scalability issues Decomposing the questions Is the serial time of my application significant in preventing scaling? How efficient is my OpenMP parallelization? If inefficient, how much gain can be achieved if I invest in fighting the inefficiencies? Which OpenMP regions/loops/barriers are worth tuning? What are their particular problems? #### If Performance Information is OpenMP "Unaware"... The questions are tied to OpenMP program structure – #pragmas Answers should be given the same way to be understandable and actionable OpenMP "unaware" views of VTune Amplifier XE Difficult to detect problems, customers blame the OpenMP runtime seeing CPU time consumption there and not understanding that this is a result of parallelization inefficiency #### Overview of summary pane ### Key to OpenMP awareness in VTune – Region based views and metrics Definition of Region Potential Gain (elapsed time metric) #### Technology used by VTune Amplifier XE OpenMP Analysis **Tracing** of OpenMP constructs to provide region/work sharing context and precise imbalance at barriers - Provided to VTune by LLVM/Intel OpenMP Runtime under profiling - Fork-Join points of parallel regions with number of working threads (Intel Compiler 14 and later) - OpenMP construct barrier points with imbalance info and OpenMP loop metadata - parallel-source-info=2 Intel compiler option to embed source file name in region name Looking at transition to OMPT, working with John M-C on interface enrichments for low overhead analysis **Sampling** to define and classify CPU time - user's code and OpenMP RTL work Classification: Locking, Scheduling, Work Forking #### VTune Amplifier XE OpenMP Analysis Workflow Start with HPC Performance Characterization analysis Explore CPU Utilization metrics related to OpenMP in summary, grid, and source views INTEL VTUNE AMPLIFIER XE MPC Performance Characterization HPC Performance Characterization viewpoint (change) 🗇 🖾 Collection Log 🧶 Analysis Target 🛕 Analysis Type 💈 Summary 🗳 Bottom-up ○ CPU Utilization ②: 76.4% ► Average CPU Usage <sup>©</sup> 18.344 Out of 24 logical CPUs Serial Time 0.021= (0.390 Parallel Region Time 7.784s (99.7%) Estimated Ideal Time 6.413s (82.2%) OpenMP Potential Gain Top OpenMP Regions by Potential Gain This section lists OpenMP regions with the highest potential for performance improvement. The Potential Gain metric shows the elapsed time that could be saved if the region was optimized to have no load imbalance assuming no runtime overhead OpenMP Potential Gain (%) 1.308s N 16.8% N 7.526s 0.240s 0.046s 0.016s 0.017s 0.001s 0.001s 0.000s 0.0% 0.000s \*N/A is applied to non-summable metrics CPU Usage Histogram This histogram displays a percentage of the wall time the specific number of CPUs were running simultaneously. Spin and Overhead time adds to the Idle CPU usage value CL: >amplxe-cl -collect hpc-performance <my\_app> ## Per-region Details in grid view: inefficiencies in elapsed time are classified and highlighted #### Details in Grid View: Serial Time Hotspots Serial hotspots under Master Thread Time Filter to exclude initialization phase #### Details on Scalable Timeline Super tiny timeline display mode – a bird's eye view showing all data without scrolling #### Details for a Region at source file level #### Intel® Vtune Amplifier Summary VTune Amplifier XE OpenMP analysis answers customers' questions about performance in the language of OpenMP constructs The analysis scales well for many-core systems with good balance of tracing and sampling collection technologies The full feature set is available in VTune Amplifier XE with Intel OpenMP and Intel MPI runtimes as a part of Intel® Parallel Studio XE # ADD OPENMP SIMD WITH INTEL® (VECTOR) ADVISOR Slides by Zakhar Matveev, Intel® Advisor Architect ## Intel Advisor: 5 tools for Efficient Vectorization and Memory utilization # Advisor Survey: vectorize and improve SIMD code performance! O Vectorized Not Vectorized - Efficiency my performance thermometer - Recommendations get tips on how to improve performance, in particular using OpenMP 4.\* and later! # Advisor Dependencies: The Answer to Tough SIMD/Threading Question #1! Is it safe to force the compiler to vectorize? ``` DO I = 1, N A(I) = A(I-1) * B(I) ENDDO ``` ``` void scale(int *a, int *b) { for (int i = 0; i < 1000; i++) b[i] = z * a[i]; }</pre> ``` Safe to vectorize (at least for given workload), use **OMP SIMD**! Use **OMP reduction** (also for threading)! True dependence proven, not way to parallelize without extra work # Intel Advisor gives recommendations to guide you where and how to add OpenMP SIMD! Read more about the issue and what to modify in your code to fix it (to enable vector parallelism!) In more detail: which function is causing the problem? Add OpenMP SIMD: reference example Issue: Inefficient memory access patterns present #### More examples: also threading aware... Recommendation: Remove OpenMP lock functions Confidence rever row ### ROOFLINE IN INTEL® ADVISOR Slides by Alex Shinsel #### What is a Roofline Chart? #### A Roofline Chart plots application performance against hardware limitations. - Where are the bottlenecks? - How much performance is being left on the table? - Which bottlenecks can be addressed, and which should be addressed? - What's the most likely cause? - What are the next steps? Roofline first proposed by University of California at Berkeley: <u>Roofline: An Insightful Visual Performance Model for Multicore Architectures</u>, 2009 Cache-aware variant proposed by University of Lisbon: <u>Cache-Aware Roofline Model: Upgrading the Loft</u>, 2013 #### **Roofline Metrics** Roofline is based on FLOPS and Arithmetic Intensity (AI). - FLOPS: Floating-Point Operations / Second - Arithmetic Intensity: FLOP / Byte Accessed Collecting this information in Intel® Advisor requires two analyses. Shortcut to run Survey followed by Trip Counts + FLOPs Runs system benchmarks and collects timing data. Collects memory traffic and FLOP data. Must be run separately due to higher overhead that would interfere with timing measurements. #### Classic vs. Cache-Aware Roofline Intel® Advisor uses the Cache-Aware Roofline model, which has a different definition of Arithmetic Intensity than the original ("Classic") model. #### **Classical Roofline** - Traffic measured from one level of memory (usually DRAM) - Al may change with data set size - Al changes as a result of memory optimizations #### Cache-Aware Roofline - Traffic measured from all levels of memory - AI is tied to the algorithm and will not change with data set size - Optimization does not change AI\*, only the performance \*Compiler optimizations may modify the algorithm, which may change the AI. #### **Ultimate Performance Limits** #### **Sub-Roofs and Current Limits** These subroofs can be used to help diagnose bottlenecks. #### The Intel® Advisor Roofline Interface Roofs are based on benchmarks run before the application. Roofs can be hidden, highlighted, or adjusted. Intel® Advisor has size- and colour-coding for dots. - Colour code by duration or vectorization status - Categories, cutoffs, and visual style can be modified. #### **Identifying Good Optimization Candidates** Focus optimization effort where it makes the most difference. - Large, red loops have the most impact. - Loops far from the upper roofs have more room to improve. #### **Identifying Potential Bottlenecks** Final roofs *do* apply; sub-roofs *may* apply. - Roofs above indicate potential bottlenecks - Closer roofs are the most likely suspects - Roofs below may contribute but are generally not primary bottlenecks #### Feature Synergy #### Overcoming the Scalar Add Peak Survey and Code Analytics tabs indicate vectorization status with colored icons. "Why No Vectorization" tab and column in Survey explain what prevented vectorization. Recommendations tab may help you vectorize the loop. Dependencies determines if it's safe to force vectorization. #### Feature Synergy #### Overcoming the Vector Add Peak Survey and Code Analytics display the vector efficiency and presence of FMAs. Recommendations may help improve efficiency or induce FMA usage. | Address | Line | Assembly | |-------------|------|---------------------------------------------------------| | 0x140001550 | | Block 1: 1660000000 <sup>③</sup> | | 0x140001550 | 262 | vmovupd ymm3, ymmword ptr [rsi+rcx*8+0x26400] | | 0x140001559 | 262 | vmovdqa ymm1, ymm0 | | 0x14000155d | 262 | vfmadd132pd ymm1, ymm3, ymmword ptr [rsi+rcx*8+0x23a80] | | 0x140001567 | 262 | vaddpd ymm2, ymm1, ymm3 | | 0x14000156b | 262 | vmovupd ymm1, ymmword ptr [rsi+rcx*8+0x26420] | | 0x140001574 | 262 | vaddpd ymm4, ymm2, ymm3 | | 0x140001578 | 262 | vmovdqa ymm5, ymm0 | | 0x14000157c | 262 | vfmadd132pd ymm5, ymm1, ymmword ptr [rsi+rcx*8+0x23aa0] | | 0x140001586 | 262 | vmovupd ymmword ptr [rsi+rcx*8+0x21100], ymm4 | | 0x14000158f | 262 | vaddpd ymm5, ymm5, ymm1 | | 0x140001593 | 262 | vaddpd ymm2, ymm5, ymm1 | | 0x140001597 | 260 | add rcx, 0x8 | | 0x14000159b | 262 | vmovupd ymmword ptr [rsi+rdx*8+0x21100], ymm2 | | 0x1400015a4 | 260 | add rdx, 0x8 | | 0x1400015a8 | 260 | cmp rcx, 0x530 | | 0x1400015af | 260 | jb 0x140001550 <block 1=""></block> | The Assembly tab\* is useful for determining how well you are making use of FMAs. \*Color coding added for clarity. #### Feature Synergy #### Overcoming the Memory Bandwidth Roofs Memory Access Patterns (MAP) identifies inefficient access patterns. Intel® SIMD Data Layout Templates (Intel® SDLT) allows code written as AOS to be stored as efficient SOA. Intel® VTune™ Amplifier can be used to further optimize cache usage. If cache usage cannot be improved, try re-working the algorithm to increase the AI (and slide up the roof) #### Intel® Advisor Roofline Summary Intel® Advisor's Roofline Chart is highly customizable and easy to generate. Lets you identify the best optimization candidates by focusing on low, large loops. Use the chart to identify the most likely bottlenecks. Intel® Advisor's many other features allow deep analysis of suspected problems and provide advice on how to overcome them. #### **Overall Conclusions:** Tuning requires analysis of what's really happening in the code Our guesses are frequently wrong We also need to understand the potential for improvement Intel tools (Vtune Amplifier, Advisor and Inspector) can help Download Intel Parallel Studio XE (includes all the tools and other goodies) and try it. - Free for students, teachers, Open Source contributors. - Free evaluation licenses for everyone else. - Or you can even pay money! #### Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="https://www.intel.com/benchmarks">www.intel.com/benchmarks</a>. Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. #### **Optimization Notice** Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804